Poverty and the well-being of NYC citizens based on environmental factors
from folium import plugins
from folium.plugins import HeatMap
from matplotlib.ticker import MaxNLocator
import pandas as pd
import folium
import plotly.express as px
import json
import seaborn as sn
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import math
import branca
import branca.colormap as cm
import warnings
warnings.filterwarnings('ignore')
!ls /datasets/socialdatadump/DTU/Social/
__dir__ = '/datasets/socialdatadump/DTU/Social/'
air_quality_311 = pd.read_csv(__dir__ + '311_air_quality.csv', parse_dates=True)
tree_data_2005 = pd.read_csv(__dir__ + '2005_Street_Tree_Census.csv')
tree_data_2015 = pd.read_csv(__dir__ + '2015_Street_Tree_Census_-_Tree_Data.csv')
air_quaility_data = pd.read_csv(__dir__ + 'Air_Quality.csv', parse_dates=True)
health_surveys = pd.read_csv(__dir__ + 'New_York_City_Community_Health_Survey.csv')
We are going to analyze New York City data both from a social and an enviromental aspect, by using the previously mentioned data on trees, air pollution, 311 complaints and poverty, to see what effects they have on one another. The observed period of time is between 2005 and 2015, let's jump right into it. First let's have a look at the number of air quality related complaints over the years. But why air quality?
Air pollution is one of the most important environmental threats to urban populations and while all people are exposed, pollutant emissions, levels of exposure, and population vulnerability vary across neighborhoods. Exposures to common air pollutants have been linked to respiratory and cardiovascular diseases, cancers, and premature deaths.
air_quality_311['Created Date'] = pd.to_datetime(air_quality_311['Created Date'])
_air_quality_311 = air_quality_311[air_quality_311['Created Date'].dt.year <= 2015]
air_quality_311_by_year = _air_quality_311.groupby(_air_quality_311['Created Date'].dt.year).size()
plt.rcParams.update({'font.size': 22})
ax = plt.figure(figsize=(20,10)).gca()
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
plt.bar(air_quality_311_by_year.index, air_quality_311_by_year)
plt.title('No. of 311 complaints on air quailty')
plt.xlabel('Year')
plt.ylabel('No. of complaints')
plt.show()
And for completeness' sake let's have a look at it using geospatial analysis.
complaint_map = folium.Map(location=[40.730610, -73.935242], zoom_start=11, tiles="Stamen Toner")
air_quality_311_for_map = air_quality_311.dropna(axis=0, subset=['Longitude','Latitude'])
complaints = [[[row['Latitude'], row['Longitude']] for _, row in air_quality_311_for_map[air_quality_311_for_map['Created Date'].dt.year == year].iterrows()] for year in range(2005,2016)]
complaint_heatmap = plugins.HeatMapWithTime(complaints, index=[x for x in range(2005, 2016)], auto_play=True, radius=5, min_opacity=0.05, max_opacity=0.9)
complaint_heatmap.add_to(complaint_map)
complaint_map
It would be too early to deduce any sort of conclusions from these alone, let alone say that nothing has changed over the years. First let's dive a little bit deeper into what 311 requests are: 311 is a toll-free number that allows people in the District to request assistance with city services and information. Because it is easy to use and easy to remember, 311 can help improve service delivery to residents, workers, and visitors in the nation's capital.
So anybody who asks for assistance or just wants to query information can call this number. What other aspects are contained within the data? Well most often than not the requests are for air quality issues, like odor, fumes, dust and smoke (see the plot below). So does this mean anything?
descriptors = air_quality_311.groupby(air_quality_311['Descriptor']).size().sort_values(ascending=False).head(10)
ax = plt.figure(figsize=(20,5)).gca()
plt.bar(descriptors.index, descriptors)
plt.title('Top10 most common types of air quality issues')
plt.ylabel('No. of occurrences')
plt.xticks(rotation=90)
plt.show()
Health surveys of New york City citizens are available since 2010. We can take a look at the correlation between the number of air quality related 311 requests and the self-reported health status. What we can see is that there is a low negative correlation between them (-0.3611), meaning that an increase in the number of requests slightly influences people's self-reported health negatively.
health_by_prevelance = health_surveys[health_surveys['Year'] <= 2015][health_surveys['Prevelance'].str.contains("Prevalence")]
for_corr_311 = air_quality_311_by_year[air_quality_311_by_year.index >= 2010]
# We need to reverse the list, because the order of the years are reversed
health_by_prevelance.insert(1, 'Number of air quality issues', list(reversed(for_corr_311.values)))
health_by_prevelance.corr(method='pearson')['Number of air quality issues']['Self-reported Health Status (excellent/very good/good)']
Well this doesn't say much. How about we introduced a new kind of dataset? Instead of looking at reports let's look at actual measurements of NYC. The air quality dataset contains information on New York City air quality surveillance data. These indicators provide a perspective across time and NYC geographies to better characterize air quality and health.
So how about different chemical concentrations in the air? Two pollutants, particulate matter and ground-level ozone, are of particular health concern, so we'll reduce the area of observations to only these two metritcs.
with open(__dir__ + 'community_districts.geojson') as f:
districts = json.load(f)
# round off, to reduce file size
for i in range(0, len(districts["features"])):
for j in range(0,len(districts["features"][i]['geometry']['coordinates'])):
try:
districts["features"][i]['geometry']['coordinates'][j] = np.round(np.array(districts["features"][i]['geometry']['coordinates'][j]),3)
except:
print(i,j)
aspects = [
'Fine Particulate Matter (PM2.5)',
'Ozone (O3)'
]
all_map_data = {}
air_quaility_data['Start_Date'] = pd.to_datetime(air_quaility_data['Start_Date'])
for aspect in aspects:
data = air_quaility_data[air_quaility_data['Name'] == aspect][air_quaility_data['Geo Type Name'] == 'CD']
map_data = pd.DataFrame(columns=['cd',aspect,'year'])
map_data['cd'] = data['Geo Join ID']
map_data[aspect] = data['Data Value']
map_data['year'] = data['Start_Date'].dt.year
all_map_data[aspect] = map_data
for key, map_data in all_map_data.items():
max_value = map_data[key].max()
fig = px.choropleth_mapbox(map_data, geojson=districts,
locations='cd',
color=key,
color_continuous_scale='aggrnyl' if key == 'Ozone (O3)' else 'agsunset',
range_color=(0, max_value),
featureidkey="properties.BoroCD",
mapbox_style="stamen-toner",
opacity=0.8,
center = {"lat": 40.730610, "lon": -73.935242},
zoom=9,
animation_frame='year',
title=key)
fig.update_geos(fitbounds="locations",visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0},)
fig.show()
What we can see is that fine particulate matter (PM2.5) is rather high in center of manhattan but relatively low in the outskirts of NYC. The opposite can be said of Ozone (O3) levels. Over the years we can see a rather drastic decrease in PM2.5 but not in O3. Contrary to what we have expected, we can see a very much visible decrease in fine particular matter over the years, especially in the center of manhattan where most of the 311 calls happen. But air pollution related 311 requests were at an all time low in 2011 and peaked again in 2015. Ozone levels did increase in 2010 and were at that level until 2013, from that point which they started drop. So what's the deal with this? Can we assume that 311 related calls are unrelated to fine particles and other air toxics? Let's look at their correlation.
start_year = 2009
end_year = 2015
column_name = '311'
df = pd.DataFrame()
for aspect in aspects:
data = air_quaility_data[(air_quaility_data['Name'] == aspect) & (air_quaility_data['Start_Date'].dt.year <= end_year) & ((air_quaility_data['Start_Date'].dt.year >= start_year))].groupby(air_quaility_data['Start_Date'].dt.year)['Data Value'].mean()
df[aspect] = data.to_frame(name=aspect)[aspect]
df[column_name] = air_quality_311[(air_quality_311['Created Date'].dt.year <= end_year) & (air_quality_311['Created Date'].dt.year >= start_year)].groupby(_air_quality_311['Created Date'].dt.year).size().to_frame(name = column_name)[column_name]
df.rename(columns={'Fine Particulate Matter (PM2.5)': 'PM2.5', 'Ozone (O3)': 'O3'}, inplace=True)
corr_airpol = df.corr(method='pearson')
sn.heatmap(corr_airpol, annot=True, cmap='YlGn')
plt.show()
So not unrelated, but there is a mild and moderate negative correlation between air quality and the number of 311 calls. Complitely the opposite of what one would expect. Or is it? It wouldn't make sense to question it were it not for the case that these are micro particles and cannot be seen by the naked eye. Humans are kind of perceptive in a sense that they are only aware of things negatively impacting them if they can actually feel the impact. A person would only report something via the 311 number if they noticed it. Let it be smoke, dust etc etc. Fine particles and Ozone cannot be detected without the capability of measuring them, but they are sure to leave their mark on one's health.
So us, humans, are mostly unaware of this impact, so naturally on the individual's level the problem is rather difficult to solve. Large-scale solutions and raising awareness of these issues are of great importance.
New York City’s air quality has improved recently, as the City and State have worked to lower emissions from regional and local sources. Despite this progress, air pollution remains a leading environmental health threat to all New Yorkers. Those most at risk include older adults, children and people with preexisting health conditions.
So what can we do as a society?

Well, trees, besides indirectly affecting temperature by being very good at shading surfaces, reducing temperatures and thus decreasing risk of harmful pollutants like ground level ozone that commonly spike on hot days in urban areas, are rather potent at removing particulate matter from the air. So more trees equals less particles, right? Well let's see that!
Street tree data from the TreesCount! 2005 and 2015 Street Tree Census, is conducted by volunteers and staff organized by NYC Parks & Recreation and partner organizations. Tree data collected includes tree species, diameter and perception of health. Accompanying blockface data is available indicating status of data collection and data release citywide.
We can check how many more trees there are in certain community districts
tree_map = folium.Map(location=[40.730610, -73.935242], zoom_start=11, tiles="Stamen Toner")
datasets = {'cb_num': tree_data_2005, 'community board': tree_data_2015}
for key, dataset in datasets.items():
data = dataset.groupby(dataset[key]).size().to_frame(name='number').reset_index()
max_value = data['number'].max()
fig = px.choropleth_mapbox(data, geojson=districts,
locations=key,
color='number',
color_continuous_scale='greens',
range_color=(0, max_value),
featureidkey="properties.BoroCD",
mapbox_style="stamen-toner",
opacity=0.8,
center = {"lat": 40.730610, "lon": -73.935242},
zoom=9,
title="Trees in 2005" if key == 'cb_num' else "Trees in 2015")
fig.update_geos(fitbounds="locations",visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0},)
fig.show()
There are two things we can talk about here: there has been an increase in the amount of trees there are all around NYC and that the outer skirts of the city are the dominant ones regarding this number. What is to be expected though, is that if we were to invert the colors of the map it would resemble tha PM2.5 map shown above. Let's check if they actually correlate.
df = tree_data_2015.groupby(tree_data_2015['community board']).size().to_frame(name='Trees').reset_index()
_data = air_quaility_data[air_quaility_data['Name'] == 'Fine Particulate Matter (PM2.5)'][air_quaility_data['Geo Type Name'] == 'CD'][air_quaility_data['Start_Date'].dt.year == 2015]
air_quaility_data_by_cd = _data.groupby(_data['Geo Join ID'])['Data Value'].mean()
df['PM2.5'] = air_quaility_data_by_cd.to_frame(name='pm2.5').reset_index()['pm2.5']
df.rename(columns={'community board': 'District'}, inplace=True)
fig, ax = plt.subplots(figsize=(20,10))
sn.heatmap(df.corr(method='pearson'), annot=True, cmap='YlGn_r',ax=ax)
plt.show()
So there we have